These materials are adapted from a course developed at Cancer Research Uk Cambridge Institute by Mark Dunning, Matthew Eldridge and Thomas Carroll.
R basics
Advantages of R
The R programming language is now recognised beyond the academic community as an effect solution for data analysis and visualisation. Notable users of R include:-
Access to existing visualisation / statistical tools
Flexibility
Visualisation and interactivity
Add-ons for many fields of research
Facilitating Reproducible Research
duke-scandal
Two Biostatiscians (later termed ‘Forensic Bioinformaticians’) from M.D. Anderson used R extensively during their re-analysis and investigation of a Clinical Prognostication paper from Duke. The subsequent scandal put Reproducible Research at the forefront of everyone’s mind.
Keith Baggerly’s talk on the subject is highy-recommended.
Support for R
Online forums such as Stack Overflow regularly feature R
Once installed, you should be able to launch RStudio by clicking on its icon:-
Entering commands in R
The traditional way to enter R commands is via the Terminal, or using the console in RStudio (bottom-left panel when RStudio opens for first time).
this doesn’t automatically keep track of the steps you did
Alternative, an R script can be used to keep a record of the commands you used.
For this course we will use a relatively new feature called R markdown.
An R markdown mixes plain text with R code
The R code can be run from inside the document and the results are displayed directly underneath
Each chunk of R code looks something like this.
Each line of R can be executed by clicking on the line and pressing CTRL and ENTER
Or you can press the green triangle on the right-hand side to run everything in the chunk
Try this now!
print("Hello World")
You can add R chunks by pressing CRTL + ALT + I
or using the Insert menu option
(can also include code from other languages such as Python or bash)
try and avoid adding code chunks manually
Getting started
At a basic level, we can use R as a calculator to compute simple sums with the +, -, * (for multiplication) and / (for division) symbols.
2 + 2
[1] 4
2 - 2
[1] 0
4 * 3
[1] 12
10 / 2
[1] 5
The answer is displayed at the console with a [1] in front of it. The 1 inside the square brackets is a place-holder to signify how many values were in the answer (in this case only one). We will talk about dealing with lists of numbers shortly…
In the case of expressions involving multiple operations, R respects the BODMAS system to decide the order in which operations should be performed.
2 + 2 *3
[1] 8
2 + (2 * 3)
[1] 8
(2 + 2) * 3
[1] 12
R is capable of more complicated arithmetic such as trigonometry and logarithms; like you would find on a fancy scientific calculator. Of course, R also has a plethora of statistical operations as we will see.
pi
[1] 3.141593
sin (pi/2)
[1] 1
cos(pi)
[1] -1
tan(2)
[1] -2.18504
log(1)
[1] 0
We can only go so far with performing simple calculations like this. Eventually we will need to store our results for later use. For this, we need to make use of variables.
Variables
A variable is a letter or word which takes (or contains) a value. We use the assignment ‘operator’, <- to create a variable and store some value in it.
x <- 10
x
[1] 10
myNumber <- 25
myNumber
[1] 25
We also can perform arithmetic on variables using functions:
sqrt(myNumber)
[1] 5
We can add variables together:
x + myNumber
[1] 35
We can change the value of an existing variable:
x <- 21
x
[1] 21
We can set one variable to equal the value of another variable:
x <- myNumber
x
[1] 25
We can modify the contents of a variable:
myNumber <- myNumber + sqrt(16)
myNumber
[1] 29
When we are feeling lazy we might give our variables short names (x, y, i…etc), but a better practice would be to give them meaningful names. There are some restrictions on creating variable names. They cannot start with a number or contain characters such as ., _, ‘-’. Naming variables the same as in-built functions in R, such as c, T, mean should also be avoided.
Naming variables is a matter of taste. Some conventions exist such as a separating words with - or using camelCaps. Whatever convention you decided, stick with it!
Functions
Functions in R perform operations on arguments (the inputs(s) to the function). We have already used:
sin(x)
[1] -0.1323518
this returns the sine of x. In this case the function has one argument: x. Arguments are always contained in parentheses – curved brackets, () – separated by commas.
Arguments can be named or unnamed, but if they are unnamed they must be ordered (we will see later how to find the right order). The names of the arguments are determined by the author of the function and can be found in the help page for the function. When testing code, it is easier and safer to name the arguments. seq is a function for generating a numeric sequence from and to particular numbers. Type ?seq to get the help page for this function.
seq(from = 3, to = 20, by = 4)
[1] 3 7 11 15 19
seq(3, 20, 4)
[1] 3 7 11 15 19
Arguments can have default values, meaning we do not need to specify values for these in order to run the function.
rnorm is a function that will generate a series of values from a normal distribution. In order to use the function, we need to tell R how many values we want
## this will produce a random set of numbers, so everyone will get a different set of numbers
rnorm(n=10)
The normal distribution is defined by a mean (average) and standard deviation (spread). However, in the above example we didn’t tell R what mean and standard deviation we wanted. So how does R know what to do? All arguments to a function and their default values are listed in the help page
(N.B sometimes help pages can describe more than one function)
?rnorm
In this case, we see that the defaults for mean and standard deviation are 0 and 1. We can change the function to generate values from a distribution with a different mean and standard deviation using the mean and sdarguments. It is important that we get the spelling of these arguments exactly right, otherwise R will an error message, or (worse?) do something unexpected.
In the examples above, seq and rnorm were both outputting a series of numbers, which is called a vector in R and is the most-fundamental data-type.
Exercise
What is the value of pi to 3 decimal places?
see the help for round?round
How can we a create a sequence from 2 to 20 comprised of 5 equally-spaced numbers?
check the help page for seq ?seq
Create a variable containing 1000 random numbers with a mean of 2 and a standard deviation of 3
what is the maximum and minimum of these numbers?
what is the average?
HINT: see the help pages for functions min, max and mean
Packages in R
So far we have used functions that are available with the base distribution of R; the functions you get with a clean install of R. The open-source nature of R encourages others to write their own functions for their particular data-type or analyses.
Packages are distributed through repositories. The most-common ones are CRAN and Bioconductor. CRAN alone has many thousands of packages.
The Packages tab in the bottom-right panel of RStudio lists all packages that you currently have installed. Clicking on a package name will show a list of functions that available once that package has been loaded. The library function is used to load a package and make it’s functions / data available in your current R session. You need to do this every time you load a new RStudio session.
## tidyversee is a collection of packages for data manipulation and visualisation
library(tidyverse)
There are functions for installing packages within R. If your package is part of the main CRAN repository, you can use install.packages
We will be using the tidyverse R package in this practical. To install it, we would do.
install.packages("tidyverse")
A package may have several dependancies; other R packages from which it uses functions or data types (re-using code from other packages is strongly-encouraged). If this is the case, the other R packages will be located and installed too.
So long as you stick with the same version of R, you won’t need to repeat this install process.
Dealing with data
We are going to explore some of the basic features of R using data from the gapminder project, which have been bundled into an R package. These data give various indicator variables for different countries around the world (life expectancy, population and Gross Domestic Product). We have saved these data as a .csv file to demonstrate how to import data into R.
You can download these data here. Right-click the link and save to somewhere on your computer that you wish to work from.
The tidyverse is an eco-system of packages that provide a consistent, intuitive system for data manipulation and visualisation in R.
The working directory
Like other software (Word, Excel, Photoshop….), R has a default location where it will save files to and import data from. This is known as the working directory in R. You can query what R currently considers its working directory by doing:-
getwd()
N.B. Here, a set of open and closed brackets () is used to call the getwd function with no arguments.
We can also list the files in this directory with:-
Any .csv file in the working directory can be imported into R by supplying the name of the file to the read.csv function and creating a new variable to store the result. A useful sanity check is the file.exists function which will print TRUE is the file can be found in the working directory.
file.exists("gapminder.csv")
[1] TRUE
If the file we want to read is not in the current working directory, we will have to write the path to the file; either relevant to the current working directory (e.g. the directory “up” from the current working directory, or in a sub-folder), or the full path. In an interactive session, you can do use file.choose to open a dialogue box. The path to the the file will then be displayed in R.
myfile <- file.choose()
myfile
data <- read.csv(myfile)
Assuming the file can be found, we can use read_csv to import. Other functions can be used to read tab-delimited files (read_delim) or a generic read.table function. A data frame object is created.
Parsed with column specification:
cols(
country = col_character(),
continent = col_character(),
year = col_integer(),
lifeExp = col_double(),
pop = col_integer(),
gdpPercap = col_double()
)
The data frame object in R allows us to work with “tabular” data, like we might be used to dealing with in Excel, where our data can be thought of having rows and columns. The values in each column have to all be of the same type (i.e. all numbers or all text).
In Rstudio , you can view the contents of the data frame we have just created. This is useful for interactive exploration of the data, but not so useful for automation and scripting and analyses.
View(gapminder)
We should always check the data frame that we have created. Sometimes R will happily read data using an inappropriate function and create an object without raising an error. However, the data might be unsuable. Consider:-
test <- read_table("gapminder.csv")
Parsed with column specification:
cols(
`"country","continent","year","lifeExp","pop","gdpPercap"` = col_character()
)
View(test)
We are going to use the dplyr package to manipulate the data frame we have just created. It is perfectly possible to work with data frame using the functions provided as part of “base R”. However, many find it easy to read and write code using dplyr.
There are many more functions available in dplyr than we will cover today. An overview of all functions is given in the following cheatsheet
selecting columns
We can access the columns of a data frame by knowing the column name using the select function. The column names that we want to display are listed after the name of the data frame, separated by a , .
We can also suppress some columns from appearing in the output by putting a - in front of the column name.
select(gapminder, -country)
A range of columns can be selected by the : operator.
select(gapminder, lifeExp:gdpPercap)
There are a number of helper functions can be employed if we are unsure about the exact name of the column.
select(gapminder, starts_with("life"))
select(gapminder, contains("pop"))
Restricting rows with filter
So far we have been returning all the rows in the output. We can use what we call a logical test to define what rows are displayed in the output. This is a test that gives either a TRUE or FALSE result. When applied to subsetting, only rows with a TRUE result get returned.
For example we could compare the lifeExp variable to 40. Internally, R creates a vector of TRUE or FALSE; one for each row in the data frame. This is then used to decide which rows to display.
filter(gapminder, lifeExp < 40)
Testing for equality can be done using ==. This will only give TRUE for entries that are exactly the same as the test string.
filter(gapminder, country == "Zambia")
N.B. For partial matches, the grep function and / or regular expressions (if you know them) can be used.
filter(gapminder, grepl("land", country))
There are a couple of ways of testing for more than one text value. The first uses an or| statement. i.e. testing if the value of country is Zambiaor the value is Zimbabwe. Remember to use double = sign to test for string equality; ==.
The %in% function is a convenient function for testing which items in a vector correspond to a defined set of values.
We can require that both tests are TRUE by using an and& operation. e.g. which years in Zambia had a life expectancy less than 40.
filter(gapminder, country == "Zambia" & lifeExp < 40)
filter(gapminder, country == "Zambia", lifeExp < 40)
To allow
filter(gapminder, country == "Zambia" | country == "Zimbabwe")
Finally, we have != for testing if something is not equal
filter(gapminder, continent != "Europe")
Exercise
Create a subset of the data where the population less than a million in the year 2002
Create a subset of the data where the life expectancy is greater than 80 in the year 2002
Create a subset of the European data where the life expectancy is greater than 80 in either the year 2002 or 2007
As well as selecting existing columns in the data frame, new columns created using the mutate function. We would typically use a function that would take an existing column and apply some operation to each entry in the column in-turn. In other words, the number of values returned by the function must be the same as the number of input values.
The whole data frame can be re-ordered according to the values in one column using the arrange function. So to order the table according to population size:-
arrange(gapminder,pop)
NA
The default is smallest --> largest by we can change this using the desc function
arrange(gapminder,desc(pop))
arrange also works on character vectors
arrange(gapminder, desc(country))
We can even order by more than one condition
arrange(gapminder, year,pop)
A final point on data frames is that we can export them out of R once we have done our data processing.
We will now try an exercise that involves using several steps of these operations
Exercise
Filter the data to include just observations from the year 2002
Order the table by increasing life expectancy
Remove the year column from the resulting data frame
Write the data frame out to a file
“Piping”
We will often need to perform an analysis, or clean a dataset, using several dplyr functions in sequence. e.g. filtering, mutating, then selecting columns of interest (possibly followed by plotting - see later).
If we wanted to filter our results to just Europe there is no point displaying the continent column in our output, so we don’t need to show it.
The following is perfectly valid R code, but invites the user to make mistakes when writing it. We also have to create multiple copies of the same data frame.
Those familiar with Unix may recall that commands can be joined with a pipe; |
In R, dplyr commands to be linked together and form a workflow. The symbol %>% is pronounced then. With a %>% the input to a function is assumed to be the output of the previous line. All the dplyr functions that we have seen so far take a data frame as an input and return an altered data frame as an output, so are ameanable to this type of programming.
The example we gave of filtering just the European countries and removing the continent column becomes:-
notice that in the select statement we don’t need to specify the name of the data frame
The R language has extensive graphical capabilities.
Graphics in R may be created by many different methods including base graphics and more advanced plotting packages such as lattice.
The ggplot2 package was created by Hadley Wickham and provides a intuitive plotting system to rapidly generate publication quality graphics.
ggplot2 builds on the concept of the “Grammar of Graphics” (Wilkinson 2005, Bertin 1983) which describes a consistent syntax for the construction of a wide range of complex graphics by a concise description of their components.
Why use ggplot2?
The structured syntax and high level of abstraction used by ggplot2 should allow for the user to concentrate on the visualisations instead of creating the underlying code.
On top of this central philosophy ggplot2 has:
Increased flexibility over many plotting systems.
An advanced theme system for professional/publication level graphics.
Large developer base – Many libraries extending its flexibility.
Large user base – Great documentation and active mailing list.
It is always useful to think about the message you want to convey and the appropriate plot before writing any R code. Resources like this should help.
With some practice, ggplot2 makes it easier to go from the figure you are imagining in our head (or on paper) to a publication-ready image in R.
As with dplyr, we won’t have time to cover all details of ggplot2. This is however a useful cheatsheet that can be printed as a reference.
Basic plot types
A plot in ggplot2 is created with the following type of command
What type of graph we want to use (The geom to use).
Lets say that we want to explore the relationship between GDP and Life Expectancy. We might start with the hypothesis that richer countries have higher life expectancy. A sensible choice of plot would be a scatter plot with gdp on the x-axis and life expectancy on the y-axis.
The first stage is to specify our dataset
library(ggplot2)
ggplot(data = gapminder)
For the aesthetics, as a bare minimum we will map the gdpPercap and lifeExp to the x- and y-axis of the plot
That created the axes, but we still need to define how to display our points on the plot. As we have continuous data for both the x- and y-axis, geom_point is a good choice.
The geom we use will depend on what kind of data we have (continuous, categorical etc)
geom_point() - Scatter plots
geom_line() - Line plots
geom_smooth() - Fitted line plots
geom_bar() - Bar plots
geom_boxplot() - Boxplots
geom_jitter() - Jitter to plots
geom_histogram() - Histogram plots
geom_density() - Density plots
geom_text() - Text to plots
geom_errorbar() - Errorbars to plots
geom_violin() - Violin plots
Boxplots are commonly used to visualise the distributions of continuous data. We have to use a categorical variable on the x-axis. In the case of the gapminder data we might have to persuade ggplot2 that the year column is a factor rather than numerical data.
Create a subset of the gapminder data frame containing just the rows for your country of birth
Has there been an increase in life expectancy over time?
visualise the trend using a scatter plot (geom_point), line graph (geom_line) or smoothed line (geom_smooth).
Customising the plot appearance
Our plots are a bit dreary at the moment, but one way to add colour is to add a col argument to the geom_point function. The value can be any of the pre-defined colour names in R. These are displayed in this handy online reference. Red, Green, Blue of Hex values can also be given.
However, a powerful feature of ggplot2 is that colours are treated as aesthetics of the plot. In other words we can use column in our dataset.
Let’s say that we want points on our plot to be coloured according to continent. We add an extra argument to the definition of aesthetics to define the mapping. ggplot2 will even decide on colours and create a legend for us.
One very useful feature of ggplot is faceting. This allows you to produce plots subset by variables in your data. In the scatter plot above, it was quite difficult to see if the relationship between gdp and life expectancy was the same for each continent. To overcome this, we would like a see a separate plot for each continent.
To facet our data into multiple plots we can use the facet_wrap or facet_grid function and specify the variable we split by.
The previous plot was a bit messy as it contained all combinations of year and continent. Let’s suppose we want our analysis to be a bit more focussed and disregard countries in Oceania (as there are only 2 in our dataset) and years between 1997 and 2002. We should know how to restrict the rows from the gapminder dataset using the filter function. Instead of filtering the data, creating a new data frame and construcing the data frame from these new data we can use the%>% operator to create the data frame on the fly and pass directly to ggplot. Thus we don’t have to save a new data frame or alter the original data.
The summarise function can take any R function that takes a vector of values (i.e. a column from a data frame) and returns a single value. Some of the more useful functions include:
min minimum value
max maximum value
sum sum of values
mean mean value
sd standard deviation
median median value
IQR the interquartile range
n_distinct the number of distinct values
n the number of observations (Note: this is a special function that doesn’t take a vector argument, i.e. column)
It is also possible to summarise using a function that takes more than one value, i.e. from multiple columns. For example, we could compute the correlation between year and life expectancy. Here we also assign names to the table that is produced.
However, it is not particularly useful to calculate such values from the entire table as we have different continents and years. The group_by function allows us to split the table into different categories, and compute summary statistics. We can group the data according to year and compute the
The countries we identify could then be used as the basis for a plot.
filter(gapminder, country %in% c("Rwanda","Zambia","Zimbabwe")) %>%
ggplot(aes(x=year, y=lifeExp,col=country)) + geom_line()
Joining
In many real life situations, data are spread across multiple tables or spreadsheets. Usually this occurs because different types of information about a subject, e.g. a patient, are collected from different sources. It may be desirable for some analyses to combine data from two or more tables into a single data frame based on a common column, for example, an attribute that uniquely identifies the subject.
dplyr provides a set of join functions for combining two data frames based on matches within specified columns. For those familiar with such SQL, these operations are very similar to carrying out join operations between tables in a relational database.
As a toy example, lets consider two data frames that contain the names of various bands, and the instruments that they play:-
band_instruments
band_members
There are various ways in which we can join these two tables together. We will just consider the case of a “left join”.
Animated gif by Garrick Aden-Buie
left_join returns all rows from the first data frame regardless of whether there is a match in the second data frame. Rows with no match are included in the resulting data frame but have NA values in the additional columns coming from the second data frame.
right_join is similar but returns all rows from the second data frame that have a match with rows in the first data frame based on the specified column.
right_join(band_members,band_instruments)
Joining, by = "name"
inner_join only returns those rows where matches could be made
inner_join(band_members,band_instruments)
Joining, by = "name"
Exercise (open-ended)
The file medal_table.csv contains data about how many medals how been won by various countries at the Beijing summer olympics of 2008.
Read this csv file into R and join with the gapminder data from 2007
What interesting summaries / plots can you make from the data?